5 Assignment 2
Instructions
You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.
Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command:
quarto render filename.ipynb --to html. Submit the HTML file.The assignment is worth 100 points, and is due on Sunday, 4th February 2024 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
- Final answers to each question are written in the Markdown cells. (1 point)
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
The maximum possible score in the assigment is 105 + 5 (proper formatting) = 110 out of 100.
5.1 1) Multiple Linear Regression (24 points)
A study was conducted on 97 male patients with prostate cancer who were due to receive a radical prostatectomy (complete removal of the prostate). The prostate.csv file contains data on 9 measurements taken from these 97 patients. Each row (observation) represents a patient and each column (variable) represents a measurement. The description of variables can be found here: https://rafalab.github.io/pages/649/prostate.html
5.1.1 1a)
Fit a linear regression model with lpsa as the response and all the other variables as the predictors. Print its summary. (2 points) Write down the optimal equation that predicts lpsa using the predictors. (2 points)
5.1.2 1b)
Is the overall regression statistically significant? In other words, is there a statistically significant relationship between the response and at least one predictor? You need to justify your answer for credit. (2 points)
5.1.3 1c)
What does the optimal coefficient of svi tell us as a numeric output? Make sure you include the predictor, (svi) the response (lpsa) and the other predictors in your answer. (2 points)
5.1.4 1d)
Check the \(p\)-values of gleason and age. Are these predictors statistically significant? You need to justify your answer for credit. (2 points)
5.1.5 1e)
Check the 95% Confidence Interval of age. How can you relate it to its p-value and statistical significance, which you found in the previous part? (2 points)
5.1.6 1f)
This question requires some thinking, and bringing your 303-1 and 303-2 knowledge together.
Fit a simple linear regression model on lpsa against gleason and check the \(p\)-value of gleason using the summary. (2 point) Did the statistical significance of gleason change in the absence of other predictors? (1 point) Why or why not? (3 points)
Hints:
- You need to compare this model with the Multiple Linear Regression model you created above.
- Printing a correlation matrix of all the predictors should be useful.
5.1.7 1g)
Predict the lpsa of a 65 year old man with lcavol = 1.35, lweight = 3.65, lbph = 0.1, svi = 0.22, lcp = -0.18, gleason = 6.75, and pgg45 = 25. Find the 95% confidence and prediction intervals as well. (2 points)
5.1.8 1h)
In the Multiple Linear Regression model with all the predictors, you should see a total of five predictors that appear to be statistically insignificant. Why is it not a good idea to directly conclude that all of them are statistically insignificant? (2 points) Implement the additional test that concludes the statistical insignificance of all five predictors. (2 points)
Hint: f_test() method